Reinforcement learning-trained optimisers and Bayesian optimisation for online particle accelerator tuning

Online tuning of particle accelerators is a complex optimisation problem that continues to require manual intervention by experienced human operators. Autonomous tuning is a rapidly expanding field of research, where learning-based methods like Bayesian optimisation (BO) hold great promise in improving plant performance and reducing tuning times. At the same time, reinforcement learning (RL) is a capable method of learning intelligent controllers, and recent work shows that RL can also be used to train domain-specialised optimisers in so-called reinforcement learning-trained optimisation (RLO). In parallel efforts, both algorithms have found successful adoption in particle accelerator tuning. Here we present a comparative case study, assessing the performance of both algorithms while providing a nuanced analysis of the merits and the practical challenges involved in deploying them to real-world facilities. Our results will help practitioners choose a suitable learning-based tuning algorithm for their tuning tasks, accelerating the adoption of autonomous tuning algorithms, ultimately improving the availability of particle accelerators and pushing their operational limits.


Performance metrics box plots
The following box plots give a better sense of the distributions of the final beam errors, steps to target and steps to convergence for each algorithm.
Figure 1 shows the MAE at the end of each optimisation by the different algorithms.It can be seen that just like Bayesian optimisation (BO), reinforcement learning-trained optimisation (RLO) outperforms the non-learning algorithms.RLO also outperforms BO, both in simulation and in the real world.We can observe that this performance difference is still large, but not quite as pronounced in the real world.The latter effect is further discussed in Sec.Real-world study.
In Fig. 2, RLO is the only algorithm to reliably get within the target threshold of ε = 40 µm before the maximum number of steps, though a significant proportion of optimisation runs performed by BO also achieve the target in time.The non-learning algorithms mostly don't achieve the target in the maximum number of steps, with extremum seeking (ES) achieving an mean absolute error (MAE) within the threshold in a small proportion of trials, Nelder-Mead Simplex achieving the target 3 times, and random search never getting there.
Figure 3 shows box plots of the number of steps taken until the optimisations by each algorithm converged.This does not necessarily mean that they converged on the target, but rather that the MAE did not change by more than ε = 40 µm.We observe that all algorithms almost always converge within the allowed number of steps.With respect to Fig. 2, this suggests that they converge to a local optimum rather than the global one.As would be expected, this effect is especially pronounced with Nelder-Mead simplex optimisation, which converges almost a quickly as RLO and BO, but almost never gets within the threshold around the target.The effect is much more subtle with RLO and BO, suggesting that they often find the global optimum, or at least a local optimum that is close to the global optimum in its objective value.

Inference times
The time it takes to infer the next set of actuator settings may also influence the algorithm choice.For the benchmark task, the inference time happens to be negligible, because our benchmarked physical system, specifically the magnets and the beam measurement, is orders of magnitude slower than the inference time.At other facilities, where the physical process takes less time, the time taken for tuning may be dominated by the inference time of the tuning algorithm and there might even be real-time requirements 1 .
Inference times for RLO and BO can vary greatly depending on the choice of model and other design parameters.Nevertheless, we performed inference time measurements on the specific implementations of RLO and BO used in our study.We measure the average inference times of both algorithms over the 45 000 inferences of the simulation study using a MacBook Pro with an M1 Pro chip running Python 3.9.15.It is observed that BO takes an average of 0.7 s to infer the next actuator settings, while RLO is more than three orders of magnitude faster at 0.0002 s.This reflects the generally expected trend that RLO is capable of faster inference because the RLO policy requires only one forward pass of the multilayer perceptron (MLP) with a complexity of O(1) with respect to the steps taken.By contrast, in each BO inference step, a full optimisation of the acquisition function is performed.This involves inferences with the GP model with complexity O(n 3 ), scaling with the number of steps taken n.Even when choosing a different faster model, BO requires an optimisation of the acquisition function in each step, meaning it is generally expected to have slower inference than RLO.Note that the RLO inference can also be sped up significantly by using specialised hardware 2 .

Bayesian optimisation implementation performance
With the number of different variants and implementations of BO that are available, it is not trivial to choose which to use for evaluations such as those presented in the main paper.We chose to use a state-of-the-art BO implementation described in Sec.Bayesian optimisation of the main paper.To ensure that our implementation matches the state-of-the-art that has evolved for the particle accelerator community, we compare our version to the Xopt package.For benchmarking purposes, we evaluated two BO versions from the Xopt backend, one uses a hard step-size constraint and the other one uses proximal-biasing 3 as a soft step-size constraint.Both use upper confidence bound (UCB) acquisition with β = 2 and perform the default pre-processing steps, i.e. normalising the input to [0, 1] and standardising the objective values.The hard version used the same step-size limit of 0.1 as the BO and RLO versions used in this study.The proximal weight was set to be 0.5, which means the acquisition drops to 1 e over 10 % of the action space.This was obtained from a hyperparameter tuning, optimising for the best MAE within 150 steps.Note that this is higher than the original value 0.1, which was used for a much smoother objective landscape.In the studied ARES task, smaller proximal weights would mostly lead to premature convergence, due to a large number of local optima as shown in Fig. 6 and Fig. 7.
Figure 4 shows the results of our comparison.While in simulation, we find that our implementation falls right in between the two provided by Xopt, on the real ARES accelerator, our implementation consistently performs the best.We therefore conclude that our implementation is representative of the state of the art in BO for particle accelerator tuning.Nevertheless, it needs to be mentioned that the proximal BO produced smoother action than the other versions, which could be favourable for other tasks where smooth actions are more critical.

Comparison to expert human operators
The three policies trained with different random seeds for the RLO implementation have been compared to two expert human operators on a single trial, with tuning runs being conducted consecutively.Note that RLO can only interact with the accelerator every 10 s to 20 s because it has to wait for magnets to reach their set points and a new beam measurement to be taken before predicting the next action.The human operators are not limited by this and can already take the next action while the magnets are settling and a trend can be identified, thus being able to interact with the accelerator at more than 1 Hz.The results indicate that while achieving a similar final beam as the expert human operators, RLO can do so faster and more consistently, despite having a much slower interaction rate.A plot of the MAE developing over the tuning by RLO compared to the human operators is shown in Fig. 5.

Optimisation space example
In the following two figures, we show parts of the objective space for one instance of the considered particle accelerator tuning task.This helps develop an intuition for the shape of the objective function and the way it is explored by RLO and BO.Note that the two figures show different slices of the objective function, as they are shown with respect to the final sample found by each algorithm.It can be seen that both algorithms successfully find the optimum without exploring large regions of the objective space.Further, the fewer samples taken by RLO also cover a much smaller region of the objective space than those taken by BO.They draw a relatively direct line to the optimum found by RLO, supporting the assumption that RLO can make use of experience from training to know in which direction to find the optimum.As is suggested by the slightly different performance achieved by RLO and BO, the algorithms converged on slightly different locations in the objective space.Encouragingly, the optima found by both are relatively close to each other.

Grid scans over target beams
In an effort to better understand the relationships between the environment's state and optimiser performance, grid scans were performed over the position and size of the target beam.Their results are visualised in Figs. 8 to 12.The target beams were scanned with 20 samples for each of the 4 beam parameters, resulting in 160 000 different target beams.Misalignments and incoming beams were kept constant for these scans.As expected, the random search showed comparable results for all the trials regardless of the target beam parameters.Whereas for both RLO and BO, small target beams generally resulted in slightly better MAEs.To verify that this is not the result of the random seed used when training the particular policy considered for the study presented in the main manuscript, two additional policies were trained with the same setup but different random seeds.They exhibit the same behaviour, as shown in Figs. 9 and 10.The cause of this effect is discussed in Sec.Edge cases and limitations.
. Final beam distances achieved by different algorithms.Given as the MAE to the target beam when the optimisation was terminated.The boxes show the interquartile range with a vertical line indicating the value of the median.The whiskers indicate the rest of the final beam distances as 1.5 times the interquartile range and outliers outside that range are indicated by black markers.
Number of steps taken by different algorithms to either reach the target within ε = 40 µm or be terminated.The boxes show the interquartile range with a vertical line indicating the value of the median.The whiskers indicate the rest of the final beam distances as 1.5 times the interquartile range and outliers outside that range are indicated by black markers.For Random search, Nelder-Mead simplex, and Extremum seeking, the boxes are drawn only as thin lines because almost all optimisation runs are terminated at their respective step limit.